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Summary  of  Program  for 


Reporting  Period 


Program  Objectives 


To  develop  practical,  low  cost,  real  time  methods  for 
suppressing  noise  which  has  been  acoustically  added  to 
speech . 

To  demonstrate  that  through  the  incorporation  of  the 
noise  suppression  methods,  speech  can  be  effectively 
analysed  for  narrow  band  digital  transmission  in  practical 
operating  environments. 


Summary  of  Tasks  and  Results 


Introduction 


This  Semiannual  technical 
at  the  end  of  September  1978  as 
during  the  period  1 *pril  1978 
This  is  the  last  technical 


report  describes  the  status 
the  result  of  work  Derformed 
through  30  September  1978. 
report  to  be  issued  under 
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contract  N001 7 3-77-C-0041 . Continuing  research  is  still 
being  pursued  under  ARPA  order  3301  and  will  be  reported 
semi-annually  under  contract  with  Naval  Research 

Laboratories.  The  next  report  is  planned  for  the  period  1 
October  78  through  31  March  79  under  succesor  contract 
N 00 17 3-79 -C -0045. 
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Suppression  of  Acoustic  Noise  in 


Speech  Using  Spectral  Subtraction 


Steven  F.  Boll 


Abstract 

stand  alone  noise  suppression  algorithm  is  presented 
for  reducing  the  spectral  effects  of  acoustically  added 
noise  in  speech.  Effective  performance  of  digital  speech 
processors  operating  in  practical  environments  may  require 
suppression  of  noise  from  the  digital  waveform.  Spectral 
subtraction  offers  a computat ional ly  efficient,  processor 
independent,  approach  to  effective  digital  speech  analysis. 
The  method,  requiring  about  the  same  computation  as 
high-speed  convolution,  suppresses  stationary  noise  for 
speech  by  subtracting  the  spectral  noise  bias  calculated 
during  non-speech  activity.  Secondary  procedures  and  then 
applied  to  attenuate  the  residual  noise  left  after 
subtraction.  Since  the  algorithm  resynthesizes  a speech 
waveform,  it  can  be  used  as  a preprocessor  to  narrow  band 
voice  communications  systems,  speech  recognition  systems  or 
speaker  authent icat ion  systems. 
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Application  of  Adaptive  Noise  Cancellation 


To  Noise  Reduction  in  Audio  Signals 


Dennis  C.  Pulsipher 


Abstract 

The  LMS  Adaptive  Noise  Cancellation  algorithm  has  been 
applied  to  the  removal  of  high-level  white  noise  from  audio 
signals.  Simulations  and  actual  acoustically  recorded 
signals  have  been  processed  successfully,  with  excellent 
agreement  between  the  results  obtained  from  simulations  and 
the  results  obtained  with  acoustically  produced  data.  A 
study  of  the  filter  length  required  in  order  to  achieve  a 
desired  noise  reduction  level  in  a hard-walled  room  is 
presented.  The  performance  of  the  algorithm  in  this 
application  is  described  and  required  modifications  are 
suggested . 

A multi-channel  processing  scheme  is  presented  which 
allows  the  adaptive  filter  to  converge  at  independent  rates 
in  different  frequency  bands.  This  is  shown  to  be  of 
particular  use  when  the  interfering  noise  is  not  white. 
Careful  implementation  of  the  scheme  allows  the  problem  to 
be  broken  into  several  smaller  ones  which  can  be  handled  by 
independent  processors,  thus  allowing  longer  filter  lengths 
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to  be  processed  in  real  time. 


This  abstract 
Dennis  Pulsipher. 


is  taken  from  the  Ph.D  dissertation  of 
This  dissertation  will  be  published  as  a 


stand-alone  technical  report. 


Estimation  of  the  Parameters  of  an  Autoregressive 
Process  in  the  Presence  of  Additive  White  Noise 

William  Done 

Abstract 

Applications  of  linear  prediction  (LP)  algorithms  have 
been  successful  in  modeling  various  physical  processes.  In 
the  area  of  speech  analysis  this  has  resulted  in  the 
development  of  LP  vocoders,  devices  and  used  in  digital 
speech  communication  systems.  The  LP  algorithms  used  in 
speech  and  other  areas  are  based  on  all-pole  models  for  the 
signal  being  considered.  With  white  noise  excitation  to  the 
model,  the  all-pole  LP  model  is  equivalent  to  the 
autoregressive  (AR)  model. 

With  the  success  of  this  model  for  speech  well 
established,  the  application  of  LP  algorithms  in  noisy 
environments  is  being  considered.  Existing  LP  algorithms 
perform  poorly  in  these  conditions.  Additive  white  noise 
severely  effects  the  intelligibility  and  quality  of  speech 
after  analysis  by  an  LP  vocoder. 
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It  is  known  that  the  addition  of  white  noise  to  an  AR 
process  produces  data  that  can  be  described  by  an 
autoregressive  moving -aver  age  (ARMA)  model.  The  AR 
coefficients  of  the  ARMA  model  are  identical  to  the  AR 
coefficients  of  the  original  AR  process.  This  dissertation 
investigates  the  practicality  of  this  model  for  estimating 
the  coefficients  of  the  original  AR  process.  The 
mathematical  details  for  this  model  are  reviewed.  Those  for 
the  autocorrelation  methods  LP  algorithm  are  also  discussed. 

Experimental  results  obtained  from  several  parameter 
estimation  techniques  are  presented.  These  methods  include 
the  autocorrelation  method  for  LP  and  a Newton -Raphson 
algorithm  which  estimates  the  ARMA  parameters  from  the  noisy 
data.  These  estimation  methods  are  applied  to  several  AR 
processes  degraded  by  additive  white  noise.  Results  show 
that  using  an  algorithm  used  on  the  ARMA  model  for  the  data 
improves  the  estimates  for  the  original  AR  coefficients. 


This  abstract  is  taken  from  the  Ph.D  dissertation  of 
William  Done.  This  dissertation  will  be  published  as  a 
stand-alone  technical  report. 
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Nonparametr ic  Rank-Order  Statistics  Applied  to  Robust 
Voiced -Unvoiced -Si lence  Classification 


B.V.  Cox  and  L.K.  Timothy 

1 

Abstract 

This  paper  describes  a theoretical  and  experimental 
investigation  for  detecting  the  presence  of  speech  in 
wide-band  noise.  A robust  algorithm  forming  the 
voiced-unvoiced-silence  decision  is  described.  This 
algorithm  is  based  on  a nonparametr ic  statistical 
signal -detection  scheme  that  does  not  require  a training  set 
of  data  and  maintains  a constant  false  alarm  rate  for  a 
broad  class  of  noise  inputs.  Two  nonparametr ic  decision 
procedures  are  investigated,  the  Kr uskal -Wal 1 is  and  the 
multiple  use  of  the  two-sample  Savage  statistic.  The 
performances  of  these  detectors  are  evaluated  and  compared 
to  that  obtained  from  manually  classifying  twenty  recorded 
utterances.  In  limited  testing,  the  average  probability  of 
misclassif ication  of  voiced  speech  for  the  Savage  case  was 
less  than  6,  13,  28,  and  55  percent,  corresponding  to 

signal-to-noise  ratios  of  30,  20,  10,  and  0 dB, 

respectively. 
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A 


Abstract 


A stand  alone  noise  suppression  algorithm  is  presented  for  reducing 
the  spectral  effects  of  acoustically  added  noise  in  speech.  Effective 
performance  of  digital  speech  processors  operating  in  practical  environments 
may  require  suppression  of  noise  from  the  digital  waveform.  Spectral 
subtraction  offers  a computationally  efficient,  processor  independent, 
approach  to  effective  digital  speech  analysis.  The  method,  requiring 
about  the  same  computation  as  high-speed  convolution,  suppresses  stationary 
noise  for  speech  by  subtracting  the  spectral  noise  bias  calculated  during 
non-speech  activity.  Secondary  procedures  and  then  applied  to  attenuate 
the  residual  noise  left  after  subtraction.  Since  the  algorithm  resynthesizes 
a speech  waveform,  it  can  be  used  as  a preprocessor  to  narrowband  voice 
communications  systems,  speech  recognition  systems  or  speaker  authentication 
systems. 


I.  Introduction 


Background  noise  acoustically  added  to  speech  can  degrade  the 
performance  of  digital  voice  processors  used  for  applications  such  as 
speech  compression,  recognition,  and  authentication  [1]  [2].  Digital 
voice  systems  will  be  used  in  a variety  of  environments  and  their  performance 
must  be  maintained  at  a level  near  that  measured  using  noise-free  input 
speech.  To  insure  continued  reliability,  the  effects  of  background  noise 
can  be  reduced  by  using  noise  cancelling  microphones,  internal  modification 
of  the  voice  processor  algorithms  to  explicitly  compensate  for  signal 
contamination,  or  preprocessor  noise  reduction. 

Noise  cancelling  microphones  although  essential  for  extremely  high 
noise  environments  such  as  the  helicopter  cockpit,  offer  little  or  no 
noise  reduction  above  1 kHz  [3]  (See  Figures  IV. 2).  Techniques  available 
for  voice  processor  modification  to  account  for  noise  contamination 
are  being  developed  [4],  [5].  But  due  to  the  time,  effort,  and  money 
spent  on  the  design  and  implementation  of  these  voice  processors  [6], 

[7],  [8],  there  is  a reluctance  to  internally  modify  these  systems. 

Preprocessor  noise  reduction  [12],  [21]  offers  the  advantage  that  noise 
stripping  is  done  on  the  waveform  itself  with  the  output  being  either  digital 
or  analog  speech.  Thus  existing  voice  processors  tuned  to  clean  speech 
can  continue  to  be  used  unmodified.  Also  since  the  output  is  speech, 
the  noise  stripping  becomes  independent  of  any  specific  subsequent  speech 
processor  implementation,  (it  could  be  connected  to  a CCD  channel  vocoder 
or  a digital  LPC  vocoder). 
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The  objectives  of  this  effort  were  to  develop  a noise  suppression 


'T 


technique,  implement  a computationally  efficient  algorithm,  and  test 
its  performance  in  actual  noise  environments.  The  approach  used  was  to 
estimate  the  magnitude  frequency  spectrum  of  the  underlying  clean  speech 
by  subtracting  the  noise  magnitude  spectrum  from  the  noisy  speech 
spectrum.  This  estimator  requires  an  estimate  of  the  current  noise 
spectrum.  Rather  than  obtain  this  noise  estimate  from  a second 
microphone  source  [9],  [10],  it  is  approximated  using  the  average  noise 
magnitude  measured  during  non-speech  activity.  Using  this  approach, 
the  spectral  approximation  error  is  then  defined  and  secondary  methods 
for  reducing  it  are  described. 

The  noise  suppressor  is  implemented  using  about  the  same  amount 
of  computation  as  required  in  a high-speech  convolution.  It  is  tested  on 
speech  recorded  in  a helicopter  environment.  Its  performance  is  measured 
using  the  Diagnostic  Rhyme  Test  (DRT),  [11],  and  is  demonstrated  using 
isometric  plots  of  short-time  spectra. 

The  paper  is  divided  into  sections  which  develop  the  spectral 
estimator,  describe  the  algorithm  implementation,  and  demonstrate  the 
algorithm  performance. 
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II.  Subtractive  Noise  Suppression  Analysis 
A.  Introduction 

This  section  describes  the  noise  suppressed  spectral  estimator. 

The  estimator  is  obtained  by  subtracting  an  estimate  of  the  noise 
spectrum  from  the  noisy  speech  spectrum.  Spectral  information 
required  to  describe  the  noise  spectrum  is  obtained  from  the  signal 
measured  during  non-speech  activity.  After  developing  the  spectral 
estimator,  the  spectral  error  is  computer  and  four  methods  for  reducing 
it  are  presented. 

The  following  assumptions  were  used  in  developing  the  analysis. 

The  background  noise  is  acoustically  or  digitally  added  to  the  speech. 

The  background  noise  environment  remains  locally  stationary  to  the 
degree  that  Its  spectral  magnitude  expected  value  just  prior  to  speech 
activity  equals  its  expected  value  during  speech  activity.  If  the 
environment  changes  to  a new  stationary  state,  there  exists  enough 
time  (about  300  ms)  to  estimate  a new  background  noise  spectral  magnitude 
expected  value  before  speech  activity  commences.  For  the  slowly  varying 
nonstationary  noise  environment,  the  algorithm  requires  a speech  activity 
detector  to  signal  the  program  that  speech  has  ceased  and  a new  noise 
bias  can  be  estimated.  Finally  it  is  assumed  that  significant  noise 
reduction  is  possible  by  removing  the  effect  of  noise  from  the  magnitude 
spectrum  only. 
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Speech,  suitably  lowpass  filtered  and  digitized,  is  analyzed  by 
windowing  data  from  half-overlapped  input  data  buffers.  The  magnitude 
spectra  of  the  windowed  data  is  calculated  and  the  spectral  noise  bias 
calculated  during  non-speech  activity  is  subtracted  off.  Resulting 
negative  amplitudes  are  then  zeroed  out.  Secondary  residual  noise 
suppression  is  then  applied.  A time  waveform  is  recalculated  from  the 
modified  magnitude.  This  waveform  is  then  overlap  added  to  the  previous 
data  to  generate  the  output  speech. 

B.  Additive  Noise  Model 

Assume  that  a windowed  noise  signal  n(k)  has  been  added  to  a windowed 
speech  signal  s(k),  with  their  sum  denoted  by  x(k).  Then 

x(k)  = s(k)  + n(k) 

Taking  the  Fourier  transform  gives 

X(ed<i))  = S(eju))  + N(eju) 

where  x(k)  < — > X(eda)) 

X(eju))  = Li1x(k)e_Juk 

k=0 


x(k) 


1 

2ir 


X(edu,)edu)kdu> 
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C.  Spectral  Subtraction  Estimator 

The  spectral  subtraction  filter  H(ejw)  is  calculated  by  replacing  the 
noise  spectrum  N(eju)  with  spectra  which  can  be  readily  measured.  The 
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magnitude  |N(eJU>)|  of  N(eJW)  is  replaced  by  its  average  value,  u(eJU)) 
taken  during  non-speech  activity,  and  the  phase  QN(eJU>)  of  N(eJU))  is 
replaced  by  the  phase  0x(ejw)  of  X(eJW).  These  substitutions  result  in 

A • 

the  spectral  subtraction  estimator,  S(eJU>): 

S(ejw)  = [|X(eju))|  - u(eju))]  ej0x(eJU>) 


or 

S(ejw)  = H(eju,)X(eju)) 


with 


H(eju))  = 


. R(eja).) 

|X(eJW)| 


p(eju))  = E{|N(eju,)|} 


0-  Spectral  Error 

The  spectral  error  e(eJa))  resulting  from  this  estimator  is  given 
by 


e(eju))  = S(e^h>)-S(e^“)  = N(eja))-p(ej<i))e'j0x 


A number  of  simple  modifications  are  available  to  reduce  the  auditory 
effects  of  this  spectral  error.  These  include:  (1)  magnitude  averaging; 

(2)  half-wave  rectification;  (3)  residual  noise  reduction;  and  (4)  additional 
signal  attenuation  during  non-speech  activity. 


E.  Magnitude  Averaging 

Since  the  spectral  error  equals  the  difference  between  the  noise 
spectrum  N and  its  mean  p,  local  averaging  of  spectral  magnitudes  can 
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be  used  to  reduce  the  error.  Replacing  |X(eJU>)|  with  |X(eJu,)|  where: 


|X(eJ“)l  ■ A I |X,(eJ")| 

" i=0  1 


X^(e^w)  = i—  time-windowed  transform  of  x(k) 


gives 


sA(ej“)  - [|X(eJ“)  - !,(ej”)]  ej0*( 


eJ“) 


The  rational  behind  averaging  is  that  the  spectral  error  becomes  approximately: 
c(eju))  = S.(eju))  - S(ejw)  = W - u 


where 


l"(eJ“)|  ■ 1 l |N,(e 
n i=0  1 


M-l 


Thus  the  sample  mean  of  |N(eJU>)|  will  converge  to  u(eJt0),  as  a longer 
average  is  taken. 

The  obvious  problem  with  this  modification  is  that  the  speech  is 
nonstationary  and  therefore  only  limited  time  averaging  is  allowed. 

DRT  results  show  that  averaging  over  more  than  three  half-overlapped 
windows  with  a total  time  duration  of  38.4  ms  will  decrease  intelligibility. 
Spectral  examples  and  DRT  scores  with  and  without  averaging  are  given 
in  the  results  section.  Based  upon  these  results,  it  appears  that  averaging 
coupled  with  half  rectification  offers  some  improvement.  The  major 
disadvantages  of  averaging  is  the  risk  of  some  temporal  smearing  of  short 
transitory  sounds. 


11 . J 


r 


F.  Half-Wave  Rectification 


For  each  frequency  w where  the  noisy  signal  spectrum  magnitude 
|X(eJt0)|  is  less  than  the  average  noise  spectrum  magnitude  ufe^),  the 
output  is  set  to  zero.  This  modification  can  be  simply  implemented 
by  half-wave  rectifying  H(eJlJ).  The  estimator  then  becomes 

S(eju))  = HR(eja))X(eju>) 

where 

H (eJw)  = H(eJa))  + |H(eju))| 

R 2 

The  input-output  relationship  between  X(eju))  and  S (ejw)  at  each  frequency 
to  is  shown  in  Figure  1 1 . 1 . 

Thus  the  effect  of  half-wave  rectification  is  to  bias  down  the 
magnitude  spectrum  at  each  frequency  w by  the  noise  bias  determined  at 
that  frequency.  The  bias  value  can  of  course  change  from  frequency 
to  frequency  as  well  as  from  analysis  time  window  to  time  window.  The 
advantage  of  half  rectification  is  that  the  noise  floor  is  reduced  by 
u(eJa>).  Also  any  low  variance  coherent  noise  tones  ar  ..„sential ly 
eliminated.  The  disadvantage  of  half  rectification  can  exhibit  itself 
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in  the  situation  where  the  sum  of  the  noise  plus  speech  at  a frequency 
to  is  less  than  u(ejw).  Then  the  speech  information  at  that  frequency 
is  incorrectly  removed  implying  a possible  decrease  in  intelligibility. 

As  discussed  in  the  section  on  results  for  the  helicopter  speech  data 
base  this  processing  did  not  reduce  intelligibility  as  measured  using 
the  DRT. 

G-  Residual  Noise  Reduction 

After  half-wave  rectification  speech  plus  noise  lying  above  p remains. 

In  the  absence  of  speech  activity  the  difference  NR  = N - peJ0n,  which 
shall  be  called  the  noise  residual,  will  for  uncorrelated  noise 
exhibit  itself  in  the  spectrum  as  randomly  spaced  narrow  bands  of  magnitude 
spikes.  See  Figure  (IV. 4).  This  noise  residual  will  have  a magnitude  between 
zero  and  a maximum  value  measured  during  non-speech  activity.  Transformed 
back  to  the  time  domain,  the  noise  residual  will  sound  like  the  sum  of 
tone  generators  with  random  fundamental  frequencies  which  are  turned  on 
and  off  at  a rate  of  about  20  ms.  During  speech  activity  the  noise 
residual  will  also  be  perceived  at  those  frequencies  which  are  not  masked 
by  the  speech. 

The  audible  effects  of  the  noise  residual  can  be  reduced  by  taking 
advantage  of  its  frame  to  frame  randomness.  Specifically  at  a given  frequency 
bin,  since  the  noise  residual  will  randomly  fluctuate  in  amplitude  at 
each  analysis  frame,  it  can  be  suppressed  by  replacing  its  current  value 
with  its  minimum  value  chosen  from  the  adjacent  analysis  frames.  Taking 

A • 

the  minimum  value  is  used  only  when  the  magnitude  of  S (eJU))  is  less 
than  the  maximum  noise  residual  calculated  during  non-speech  activity. 

The  motivation  behind  this  replacement  scheme  is  threefold:  first,  if 
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the  amplitude  of  S (eJuJ)  lies  below  the  maximum  noise  residual  and  it 
varies  radically  from  analysis  frame  to  frame,  then  there  is  a high 
probability  that  the  spectrum  at  that  frequency  is  due  to  noise,  therefore, 
suppress  it  by  taking  the  minimum;  second,  if  S (eJa))  lies  below  the 
maximum  but  has  a nearly  constant  value,  there  is  a high  probability 
that  the  spectrum  at  that  frequency  is  due  to  low  energy  speech,  therefore, 

A ' 

taking  the  minimum  will  retain  the  information;  and  third,  if  S (eJW) 
is  greater  than  the  maximum,  there  is  speech  present  at  that  frequency, 
therefore,  removing  the  bias  is  sufficient.  Tne  amount  of  noise  reduction 
using  this  replacement  scheme  was  judged  equivalent  to  that  obtained 
by  averaging  over  three  frames.  However,  with  this  approach  high  energy 
frequency  bins  are  not  averaged  together.  The  disadvantage  to  the  scheme 
is  that  more  storage  is  required  to  save  the  maximum  noise  residuals 
and  the  magnitude  values  for  three  adjacent  frames. 

The  residual  noise  reduction  scheme  is  implemented  as 

|S,(eJ“)l  - |S,(eJ">|  , for  |S,(ej“)|i  NAX|NR(ej“)| 

|Sj(e'i“)|  * min{|Sj(e^“)|  j * 1-1,  1,  1+1 ).  for  |Si(ej“)|  <MAX  |NR(e:i")| 

where 

Si(ejw)  = HR(ejw)Xi(ej“) 
and 

MAX  |NR(e'’u>)|  = maximum  value  of 

noise  residual  measured  during 
non-speech  activity 
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H.  Additional  Signal  Attenuation  During  Non-Speech  Activity 

A • • 

The  energy  content  of  S(eJll))  relative  to  p(eJW)  provides  an  accurate 
indicator  of  the  presence  of  speech  activity  within  a given  analysis  frame. 

a • 

If  speech  activity  is  absence  then  S(eJW)  will  consist  of  the  noise  residual 
which  remains  after  half-wave  rectification  and  minimum  value  selection. 
Empirically,  it  was  determined  that  the  average  (before  versus  after)  power 
ratio  was  down  at  least  12  dB.  This  implied  a measure  for  detecting 
the  absence  of  speech  given  by: 


T = 20  log 


10 


1 f* 

2tt  J 
1 -IT 


S(eJ) 


u(e^) 
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If  T was  less  than  -12dB  the  frame  was  classified  as  having  ro  speech 
activity.  During  the  absence  of  speech  activity  there  are  at  least  three 
options  prior  to  resynthesis:  do  nothing,  attenuate  the  output  by  a 
fixed  factor,  or  set  the  output  to  zero.  Having  some  signal  present 
during  non-speech  activity  was  judged  to  give  the  higher  quality  result. 

A possible  reason  for  this  is  that  noise  present  during  speech  activity 
is  partially  masked  by  the  speech.  Its  perceived  magnitude  should  be 
balanced  by  the  presence  of  the  same  amount  of  noise  during  non-speech 
activity.  Setting  the  buffer  to  zero  had  the  effect  of  amplifying  the 
noise  during  speech  activity.  Likewise,  doing  nothing  had  the  effect  of 
amplifying  the  noise  during  non-speech  activity.  A reasonable  though 
by  no  means  optimum  amount  of  attenuation  was  found  to  be  -30  dB.  Thus 
the  output  spectral  estimate  including  output  attenuation  during  non-speech 
activity  is  given  by 
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III.  Algorithm  Implementation 


A.  Introduction 

Based  on  the  development  of  the  last  section,  a complete  analysis- 
synthesis  algorithm  can  be  constructed.  This  section  presents  the  specifica- 
tions required  to  implement  a spectral  subtraction  noise  suppression  system. 

B.  Input-Output  Data  Buffering  and  Windowing 

Speech  from  the  A-D  converter  is  segmented  and  windowed  such  that 
in  the  absence  of  spectral  modifications  if  the  synthesis  speech  segments 
are  added  together,  the  resulting  overall  system  reduces  to  an  identity. 

The  data  is  segmented  and  windowed  using  on  the  result  [12]  that  if  a 
sequence  is  separated  into  half-overlapped  data  buffers,  and  each  buffer 
is  multiplied  by  a Hanning  window,  then  the  sum  of  these  windowed  sequences 
add  back  up  to  the  original  sequences.  The  window  length  is  chosen  to 
be  approximately  twice  as  large  as  the  maximum  expected  pitch  period 
for  adequate  frequency  resolution  [13].  For  the  sampling  rate  of  8.00 
kHz  a window  length  of  256  points  shifted  in  steps  of  128  points  was 
used.  Figure  III.l  shows  the  data  segmentation  and  advance. 


C.  Frequency  Analysis 


The  DFT  of  each  data  window  Is  taken  and  the  magnitude  Is  computed. 

Since  real  data  Is  being  transformed,  two  data  windows  can  be  transformed 
using  one  FFT  [14].  The  FFT  size  Is  set  equal  to  the  window  size  of  256. 
Augmentation  with  zeros  was  not  Incorporated.  As  correctly  noted  by 
J.  Allen  [15].  spectral  modification  followed  by  Inverse  transforming 
can  distort  the  time  waveform  due  to  temporal  aliasing  caused  by  circular 
convolution  with  the  time  response  of  the  modification.  Augmenting  the 
Input  time  waveform  with  zeros  before  spectral  modification  will  minimize 
this  aliasing.  Experiments  with  and  without  augmentation  using  the 
helicopter  speech  resulted  In  negligible  differences  and  therefore  augmenta- 
tion was  not  Incorporated.  Finally,  since  real  data  Is  analyzed  transform 
symmetries  were  taken  advantage  of  to  reduce  storage  requirements  essentially 
In  half  [14]. 

0.  Magnitude  Averaging 

As  was  described  In  the  previous  section,  the  variance  of  the  noise 
spectral  estimate  Is  reduced  by  averaging  over  as  many  spectral  magnitude 
sets  as  possible.  However,  the  nonstatlonarlty  of  the  speech  limits 
the  total  time  Interval  available  for  local  averaging.  The  number  of 
averages  Is  limited  by  the  number  of  analysis  windows  which  can  be  fit 
Into  the  stationary  speech  time  Interval.  The  choice  of  window  length 
and  averaging  Interval  must  compromise  between  conflicting  requirements. 

For  acceptable  spectral  resolution  a window  length  greater  than  twice 
the  expected  largest  pitch  period  Is  required  with  a 256  point  window 
being  used.  For  minimum  noise  variance  a large  number  of  windows  are 
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required  for  averaging.  Finally,  for  acceptable  time  resolution  a narrow 
analysis  interval  is  required.  A reasonable  compromise  between  variance 
reduction  and  time  resolution  appears  to  be  three  averages.  This  results 
in  an  effective  analysis  time  Interval  of  38  ms. 

E.  Bias  Estimation 

The  spectral  subtraction  method  requires  an  estimate  at  each  frequency 
bln  of  the  expected  value  of  noise  magnitude  spectrum,  pn: 

UN  • E{ |N| } 

This  estimate  Is  obtained  by  averaging  the  signal  magnitude  spectrum 
] X | during  non-speech  activity.  Estimating  pn  In  this  manner  places 
certain  constraints  when  Implementing  the  method.  If  the  noise  remains 
stationary  during  the  subsequent  speech  activity,  then  an  initial  startup 
or  calibration  period  of  noise-only  signal  is  required.  During  this  period 
(on  the  order  of  a third  of  a second)  an  estimate  of  pn  can  be  computed. 

If  the  noise  environment  Is  nonstationary  then  a new  estimate  of  pn 
must  be  calculated  prior  to  bias  removal  each  time  the  noise  spectrum 
changes.  Since  the  estimate  is  computed  using  the  noise-only  signal 
during  non-speech  activity,  a voice  switch  is  required.  When  the  voice 
switch  is  off  an  average  noise  spectrum  can  be  recomputed.  If  the  noise 
magnitude  spectrum  is  changing  faster  than  an  estimate  of  it  can  be  computed, 
then  time  averaging  to  estimate  cannot  be  used.  Likewise  if  the 
expected  value  of  the  noise  spectrum  changes  after  an  estimate  of  it 
has  been  computed,  then  noise  reduction  through  bias  removal  will  be  less 
effective  or  even  harmful,  ie  removing  speech  where  little  noise  is  present. 


F.  Bias  Removal  and  Half-Wave  Rectification 

A 

The  spectral  subtraction  spectral  estimate  S is  obtained  by  subtracting 
the  expected  noise  magnitude  spectrum  u from  the  magnitude  signal 
spectrum  |X| 

Thus: 

| S(k ) | = | X(k)  | - u(k ) k = 0,  1 L-l 

or 

S(k)  = H(k)*X(k),  H(k)  = 1 - k = 0,  1 L-l 

where  L = DFT  buffer  length. 

After  subtracting,  the  differenced  values  having  negative  magnitudes 
are  set  to  zero  (half-wave  rectification).  These  negative  differences 
represent  frequencies  where  the  sum  of  speech  plus  local  noise  is  less 
than  the  expected  noise. 

G.  Residual  Noise  Reduction 

As  discussed  in  the  previous  section,  the  noise  that  remains  after 
the  mean  is  removed  can  be  suppressed  or  even  removed  by  selecting  the 
minimum  magnitude  value  from  the  three  adjacent  analysis  frames  in  each 
frequency  bin  where  the  current  amplitude  is  less  than  the  maximum  noise 
residual  measured  during  non-speech  activity.  This  replacement  procedure 
follows  bias  removal  and  half-wave  rectification.  Since  the  minimum 
is  chosen  from  values  on  each  side  of  the  current  time  frame,  the  modifica- 
tion induces  a one  frame  delay.  The  improvement  in  performance  was 
judged  superior  to  three  frame  averaging  in  that  an  equivalent  amount 
of  noise  suppression  resulted  without  the  adverse  effect  of  high-energy 


spectral  smoothing.  The  following  section  presents  examples  of  spectra 
with  and  without  residual  noise  reduction. 

H.  Additional  Noise  Suppression  During  Non-Speech  Activity 

The  final  improvement  in  noise  reduction  is  signal  suppression  during 
non-speech  activity.  As  was  discussed,  a balance  must  be  maintained 
between  the  magnitude  and  characteristics  of  the  noise  that  is  perceived 
during  speech  activity  and  the  noise  that  is  perceived  during  speech 
absence. 

An  effective  speech  activity  detector  was  defined  using  spectra 
generated  by  the  spectral  subtraction  algorithm.  This  detector  required 
the  determination  of  a threshold  signaling  absence  of  speech  activity. 

This  threshold  (T  = -12dB)  was  empirically  determined  to  insure  that 
only  signals  definitely  consisting  of  background  noise  would  be  attenuated. 

I.  Synthesis 

After  bias  removal,  rectification,  residual  noise  removal,  and 
non-speech  signal  suppression,  a time  waveform  is  reconstructed  from 
the  modified  magnitude  corresponding  to  the  center  window.  Again  since 
only  real  data  is  generated,  two  time  windows  are  computed  simultaneously 
using  one  inverse  FFT.  The  data  windows  are  then  overlap  added  to  form 
the  output  speech  sequence.  The  overall  system  block  diagram  is  given 
in  Figure  III. 2. 
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VI.  Results 


A.  Introduction 

Examples  of  the  performance  of  spectral  subtraction  will  be  presented 
in  two  forms:  isometric  plots  of  time  versus  frequency  magnitude  spectra; 
with  and  without  noise  cancellation,  and  intelligibility  and  quality 
measurement  obtained  from  the  Diagnostic  Rhyme  Test  (DRT)  [11].  The 
DRT  is  a well  established  method  for  evaluating  speech  processing  devices. 
Testing  and  scoring  of  the  DRT  data  base  was  provided  by  Dynastat  Inc. 

[12].  A limited  single  speaker  DRT  test  was  used.  The  DRT  data  base 
consisted  of  192  words  using  speaker  RH  recorded  in  a helicopter  environ- 
ment. A crew  of  8 listeners  were  used. 

The  results  are  presented  as  follows:  (1)  short  time  amplitude 
spectra  of  helicopter  speech;  (2)  DRT  intelligibility  and  quality  scores 
on  LPC  vocoded  speech  using  as  input  the  data  given  in  (2);  and  (4) 
short  time  spectra  showing  additional  improvements  in  noise  rejection 
through  residual  noise  suppression  and  nonspeech  signal  attenuation. 

B.  Short  Time  Spectra  of  Helicopter  Speech 

Isometric  plots  of  time  versus  frequency  magnitude  spectra  were 
constructed  from  the  data  by  computing  and  displaying  magnitude  spectra 
from  sixty-four  overlapped  Hanning  windows.  Each  line  represents  a 
128  point  frequency  analysis.  Time  increases  from  bottom  to  top  and 
frequency  from  left  to  right. 

A 920  ms  section  of  speech  recorded  with  a noise  cancelling  microphone 
in  a helicopter  environment  is  presented.  The  phrase  "Save  your"  was 
filtered  at  3.2  kH2  and  sampled  at  6.67  kHz.  Since  the  noise  was 
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acoustically  added,  no  underlying  clean  speech  signal  is  available. 

Figure  IV. 1 shows  the  digitized  time  signal.  Figure  IV. 2 shows  the 
average  noise  magnitude  spectrum  computed  by  averaging  over  the  first 
300  ms  of  non-speech  activity.  The  short  time  spectrum  of  the  noisy 
signal  x is  shown  in  Figure  IV.  3.  Note  tne  high  amplitude,  narrow 
band  ridges  corresponding  to  the  fundamental  (1550  Hz)  and  first  harmonic 
(3100  Hz)  of  the  helicopter  engine,  as  well  as  the  ramped  noise  floor 
above  1800  Hz.  Figure  IV. 4 shows  the  result  from  bias  removal  and 
rectification.  Figures  IV. 5.  and  IV. 6 show  the  noisy  spectrum  and  the 
spectral  subtraction  estimate  using  three  frame  averaging. 

These  figures  indicate  that  considerable  noise  rejection  has  been 
achieved  although  some  noise  residual  remains.  The  next  step  was  to 
quantitatively  measure  the  effect  of  spectral  subtraction  on  intelligibility 
and  quality.  For  this  task  a limited  single  speaker  DRT  was  invoked  to 
establish  an  anchor  point  for  credibility. 

C.  Intelligibility  and  Quality  Results  using  the  DRT 

The  DRT  data  base  consisted  of  192  words  recorded  in  a helicopter 
environment.  The  data  base  was  filtered  at  4 kHz  and  sampled  at  8 kHz. 
During  the  pause  between  each  word,  the  noise  bias  was  updated.  Six 
output  speech  files  were  generated:  (1)  Digitized  original;  (2)  speech 
resulting  from  bias  removal  and  rectification  without  averaging;  (3) 

speech  resulting  from  bias  removal  and  rectification  using  three  averages; 
(4)  an  LPC  vocoded  version  of  original  speech;  (5)  an  LPC  vocoded  version 

of  (2);  and  (6)  an  LPC  vocoded  version  of  (3).  The  last  three  experiments 
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were  conducted  to  measure  intelligibility  and  quality  improvements  resulting 
from  the  use  of  spectral  subtraction  as  a preprocessor  to  a LPC  analysis- 
synthesis  device.  The  LPC  vocoder  used  was  a non-real  time  floating 
point  implementation  [17].  A 10  pole  autocorrelation  implementation 
was  used  with  a SIFT  pitch  tracker  [18].  The  channel  parameters  used 
for  synthesis  were  not  quantized.  Thus  any  degradation  would  not  be 
attributed  to  parameter  quantization  but  rather  to  the  all-pole  approxima- 
tion to  the  spectrum  and  to  the  buzz-hiss  approximation  to  the  error 
signal.  In  addition,  a frame  rate  of  40  frames/sec.  was  used  which  is 
typical  of  2400  bps  implementations.  The  vocoder  on  3.2  kHz  filtered 
clean  speech  achieved  a DRT  score  of  88. 

In  addition  to  intelligibility,  a course  measure  of  quality  [19] 
was  conducted  using  the  same  DRT  data  base.  These  quality  scores  are 
neither  quantitatively  nor  qualitatively  equivalent  to  the  more  rigorous 
quality  tests  such  as  PARM  or  DAM  [20].  However,  they  do  indicate  on 
a relative  scale  improvements  between  data  sets.  Modern  2.4Kbps  systems 
are  expected  to  range  from  45  to  50  on  composite  acceptability;  unprocessed 
speech,  88-92. 

The  results  of  the  tests  are  summarized  in  Tables  IV. 1 through 
IV. 4.  Tables  IV. 1 and  IV. 2 indicate  that  spectral  subtraction  alone 
does  not  decrease  intelligibility  but  does  increase  quality  especially 
in  the  areas  of  increased  pleasantness  and  inconspicuousness  of  noise 
background.  Tables  IV. 3 and  IV. 4 clearly  indicate  spectral  subtraction 
can  be  used  to  improve  the  intelligibility  and  quality  of  speech  processed 
through  an  LPC  bandwidth  compression  device. 


D.  Short  Time  Spectra  Using  Residual  Noise  Reduction  and  Non-Speech 
Signal  Attenuation 

Based  on  the  promising  results  of  these  preliminary  DRT  experiments 
the  algorithm  was  modified  to  incorporate  residual  noise  reduction  and 
non-speech  signal  attenuation.  Figure  15  shows  the  short  time  spectra 
using  the  helicopter  speech  data  with  both  modifications  added.  Note 
that  now  noise  between  words  has  been  reduced  below  the  resolution  of  the 
graph  and  noise  within  the  words  significantly  attenuated  (compare  with 
Figure  IV. 4. 
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V.  Summary  and  Conclusions 

A preprocessing  noise  suppression  algorithm  using  spectral  subtraction 
has  been  developed,  implemented,  and  tested.  Spectral  estimates  for  the 
background  noise  were  obtained  from  the  input  signal  during  non-speech 
activity.  The  algorithm  can  be  implemented  using  a single  microphone 
source  and  requires  about  the  same  computation  as  a high-speech  convolution. 
Its  performance  was  demonstrated  using  short-time  spectra  with  and  with- 
out noise  suppression,  and  quantitatively  tested  improvements  in 
intelligibility  and  quality  using  the  Diagnostic  Rhyme  test  conducted 
by  Dynastat  Inc. 

Results  indicate  overall  significant  improvements  in  quality  and 
intelligibility  when  used  as  a preprocessor  to  a LPC  speech  analysis- 
synthesis  vocoder. 
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Table  IV. 1 


Diagnostic  Rhyme  Test  Scores 


Original 

A 

S (No  Average) 

A 

S (Three  Average) 

1 

Voicing 

95 

92 

91 

Nasality 

82 

78 

77 

Sustention 

92 

87 

86 

Sibilation 

75 

83 

84 

Graveness 

68 

70 

66 

Compactness 

88 

87 

88 

Total 

84 

83 

82 
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Table  IV. 2 


Quality  Ratings 


Original 

Naturalness  of 

Signal 

63 

Inconspicuousness 
of  Background 

36 

Intelligibility 

30 

Pleasantness 

20 

Overall 

Acceptabi 1 i ty 

27 

Composite 

26 

Acceptability 


S (No  Average)  S (Three  Averages) 


60 

61 

38 

42 

32 

33 

31 

25 

33 

29 

32 

29 

m 


Table  IV. 3 

Diagnostic  Rhyme  Test  Scores 

LPC  on  ^ LPC  on 

A 

LPC  on 

Original 

S without  averaging 

S with  averag 

Voicing 

84 

90 

86 

Nasality 

56 

63 

52 

Sustention 

49 

52 

56 

Sibilation 

61 

70 

88 

Graveness 

61 

62 

59 

Compactness 

83 

83 

93 

Total 

66 

70 

72 

i 


Table  IV. 4 
Quality  Ratings 


LPC  on 

~ LPC  on 

/s  LPC  on 

Original 

S without  averaging 

S with  averaging 

Naturalness 

53 

49 

58 

of  Signal 

Inconspicuousness 

34 

36 

39 

of  Background 

Intelligibility 

28 

30 

28 

Pleasantness 

15 

28 

20 

Overall 

24 

28 

26 

Acceptability 

Composite 

23 

29 

25 

Acceptability 

PROCESS 


- 39  - 


-1  95.3C+4 


16  86  3 0 46  56  60 

1 666C+6  6 14403 

RECORD  1 - 6 1 44  SAMPLES 


Figure  IV. 1 Time  Waveform  of  Helicopter  Speech.  "Save  your 


Figure  IV. 6 Short  Time  Spectrum  using  Bias  Removal  and  Half-wave 
Rectification  after  Three  Frame  Averaging. 

46 


Figure  Iv.7  Short  Time  Spectrum  using  Bias  Removal,  Half-wave 
Rectification,  Residual  Noise  Reduction,  and  Non- 
speech Signal  Attenuation,  (Helicopter  speech). 
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ABSTRACT 

This  paper  describes  a theoretical  and  experimental  investiga- 
tion for  detecting  the  presence  of  speech  in  wide-band  noise.  A robust 
algorithm  for  making  the  voiced-unvoiced-silence  decision  is  described. 
This  algorithm  is  based  on  a nonparametric  statistical  signal-detection 
scheme  that  does  not  require  a training  set  of  data  and  maintains  a 
constant  false  alarm  rate  for  a broad  class  of  noise  Inputs.  Two  non- 
parametric decision  procedures  are  investigated,  the  Kruskal-Wal 1 Is  and 
the  multiple  use  of  the  two-sample  Savage  statistic.  The  performances 
of  these  detectors  are  evaluated  and  compared  to  that  obtained  from 
manually  classifying  twenty  recorded  utterances.  In  limited  testing, 
the  average  probability  of  misclassif icat ion  of  voiced  speech  for  the 
Se.vage  case  was  less  than  6,  13,  28,  and  55  percent,  corresponding  to 
signal-to-noise  ratios  of  30,  20,  10,  and  0 dB,  respectively. 
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I.  INTRODUCTION 

The  problem  of  classifying  speech  in  noise  as  voiced,  unvoiced, 
or  silence  (noise  alone)  is  one  of  the  most  fundamental,  important,  and 
difficult  problems  encountered  in  speech  processing  (1,  2,  3,  4],  The 
voiced,  unvoiced,  or  silence  decision  is  required  in  most  computer- 
oriented  speech  communications,  understanding,  or  recognition  systems. 
Various  approaches  for  making  this  decision  have  been  reported  in  the 
speech  literature.  In  most  of  these  papers,  the  detection  of  speech 
in  background  noise  was  conducted  in  a relatively  noise-free  environment 
under  ideal  laboratory  acoustic  recording  conditions.  However,  such 
ideal  acoustic  environments  are  not  realizable  for  practical  usage  of 
speech  processing  systems. 

Practical  application  of  the  speech  processing  systems  requires 
the  development  of  robust  speech  algorithms  so  that  speech  quality 
does  not  degrade  to  an  unacceptable  level  in  the  presence  of  acoustically 
coupled  background  and  channel  noise,  including  telephone  and  radio 
communication  applications  with  speaker  variations  and  nonstationary 
aspects,  tandoming  and  conferencing  configurations,  and  in  the  presence 
of  communications  jamming  [2,  5], 

The  voice-unvoiced-silence  decision  is  a difficult  problem  in 
these  real  environments.  This  paper  reports  the  investigation  of  a 
nonpar ame trie,  rank-order  statistical  decision  procedure  that  shows 
promise.  It  is  theoretically  robust  in  the  communication  sense,  main- 
taining a constant  false  alarm  rate  (type  I error)  independent  of  noise 
power  for  a large  class  of  distributions.  Although  this  detection 
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approach  1b  new  to  speech  processing.  It  Is  a mature  statistical 


discipline.  The  nonparametric  detection  review  paper  by  Thomas  (6) 
indicates  that  a bibliography  published  In  1962  gives  more  than  3000 
references.  The  application  and  analysis  of  nonparametric  detections 
historically  has  been  confined  to  nonengineering  problems,  an  engineer- 
ing text  has  only  recently  been  published  [7).  Nonparametric  decision 
procedures  have  been  recently  applied  to  radar  systems  that  must 
operate  In  an  environment  of  Intense  external  interference  [7]. 

The  principal  feature  of  nonparametric  detection  for  this  engi- 
neering application  is  its  ability  to  maintain  a constant  false-alarm 
rate  for  large  classes  of  noise  distributions  (equipment  noise,  weather, 
clutter,  interference).  Some  specific  advantages  applied  to  the  speech 
voiced-unvoiced-silence  detection  are: 

1.  It  maintains  a constant  false-alarm  rate  with  a fixed 
threshold  for  large  classes  of  noise  distributions. 

2.  It  is  robust  (insensitive  to  changes  not  under  test)  and 
powerful  (sensitive  to  specific  factors  under  test)  in  a 
statistical  sense. 

3.  It  does  not  require  statistical  information  about  either 
the  signal  or  the  background  noise  (does  not  require  a 
training  set  of  data)  to  set  a decision  threshold. 

4.  Performance  for  signals  in  non-Caussian  noise  may  often 
surpass  that  of  detection  optimized  against  Gaussian  noise. 

5.  It  will  operate  where  the  noise  statistics  are  nonstationary 
or  change  from  one  application  to  another. 
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6.  It  is  simple  to  implement  digitally. 

7.  For  large  sample  sizes,  it  can  be  as  efficient  as  the  Nymann- 
Pearson  detection  for  a wide  class  of  noise  distribution. 

The  technique  developed  in  this  paper  is  designed  to  discriminate 
against  wide-band  noise,  but  is  expected  to  do  poorly  against  narrow- 
band  noise.  However,  with  some  reasonable  modifications,  the  narrow- 
band  noise  problem  could  be  moderated. 

Although  the  voiced-unvoiced-silence  decision  has  wide  speech 
system  application,  a considerable  part  of  this  research  was  motivated 
by  the  requirements  of  digital  communications  systems.  The  past 
several  years  have  seen  notable  advances  in  the  linear  predictive  cod- 
ing (LPC)  vocoder,  research,  development,  implementation,  including 
hardware  and ‘software  realization.  This  effort  to  develop  and  imple- 
ment an  all-digital  communications  system  has  resulted  in  hardware 
implementation  of  the  LPC  vocoder  alogrithm.  The  LPC  algorithm  was 
designed  in  a relatively  noise-free  environment;  its  quality  and  per- 
formance degrade  in  the  presence  of  background  noise.  Practical  usage 
of  the  LPC  vocoder  in  acoustically  adverse  environments  has  identified 
a need  for  more  robust  speech-processing  algorithms.  The  principal 
objective  of  this  research  was  to  address  the  robust  speech  detection 
issue  in  the  presence  of  wide-band  noise. 
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II.  BACKGROUND 

The  problem  of  detecting  voice  signals  in  the  presence  of  noise 
has  only  been  addressed  by  a small  number  of  investigations.  In  these 
investigations,  the  traditional  approach  to  distinguish  between  voice 
and  noise  was  to  level  detect  waveform  energy  [1,  8,  9],  The  threshold 
normally  was  experimentally  determined  by  a limited  training  set  of 
data  [9,  10],  by  the  maximum  noise  power  recommended  t>y  CCITT  for 
telephone  channels  [4,  9,  11],  or  by  a threshold  adjustment  process 
updated  on  a fixed  schedule  (every  half  second)  [12], 

Recently,  Atal  and  Rabiner  [13]  suggested  a pattern  recognition 
approach  to  voiced-unvoiced-silence  classification  in  five  measurements 
or  features  — energy,  zero-crossing  rate,  autocorrelation  coefficient 
at  unit  sample  delay,  first  predictor  coefficient,  and  energy  of  the 
predictor  errors  were  combined  using  a non-Euclidian  distance  metric 
to  give  a reliable  decision.  This  method  was  optimized  for  telephone 
line  inputs  by  Rabiner,  et  al.  [14],  and  used  for  digit  recognition  by 
Rabiner,  et  al.  [15,  16],  The  algorithm  was  modified  to  do  an  average 
signal  spectrum  template  match  using  an  LPC  distance  measure  [17]. 

Siegel  and  Steiglltz  [18]  proposed  a modification  to  the  Atal 
[13]  algorithm  in  which  a relatively  small  set  of  samples  was  used  to 
train  the  classifier  using  three  features  — LPC  normalized  minimum 
error,  RMS  value,  and  ratio  of  high-to-low  frequency  energy. 

Lin  [19]  and  Adoul  [20,  21]  modified  Atal  and  Rabiner 's  pattern 
recognition  approach  for  their  proposed  detectors. 
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Sarma  and  Venugopal  [22]  suggested  a classification  technique 
requiring  less  computational  effort  based  on  the  concept  of  variable 


decision  space,  using  only  three  features  and  by  avoiding  linear  pre- 
dictive analysis. 

The  pattern  recognition  approach  to  the  voiced-unvoiced-silence 
classification  has  usefulness  for  many  speech  processing  systems  ap- 
plications. However,  it  does  not  address  the  robustness  issue  in  a 
communications  sense  since  the  scheme  requires  a training  set  of  data 
and  will  operate  without  degradation  in  performance  only  for  that 
particular  recording  condition.  The  nonstationary  speaking  environ- 
ment limitation  mentioned  by  Atal  and  Rabiner  still  exists  [13]. 

An  optimum  classification  detector,  suggested  by  McAuley  [23], 
in  which  a matched  digital  Wiener  filter  was  designed  for  each  signal 
class,  parallel  processed  the  signal  by  each  of  these  filters.  A 
statistical  maximum  likelihood  decision  criterion  was  used  to  make  this 
final  classification.  Rabiner  [15]  indicated  that  this  approach  shows 
promise,  but  that  it  requires  a large  amount  of  signal  processing,  and 
has  not  as  yet  been  extensively  tested. 

McAuley  [24]  modified  his  method  to  include  an  adaptive  noise 
cancellation  algorithm.  The  training  requirement  for  this  algorithm, 
though  not  as  stringent  as  the  Atal-Rabiner  algorithm,  requires  a 300 
ms  speech-free  interval  to  determine  noise  detection  thresholds. 
Jankowski  [12]  developed  an  adaptive  threshold  method  that  operated  on 
a fixed  schedule  every  half  second  to  train  the  detector. 
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III.  RATIONALE 

The  following  rationale  presents  a nonparametric  approach  to 
speech  detection  which  requires  no  training  sets  or  adaptive  techniques. 
A nonparametric  rank-ordered  statistical  detection  technique  is  used 
to  classify  a sequence  of  small  intervals  of  data  as  voiced,  unvoiced, 
or  silence.  The  strategy  of  nonparametric  detection  used  in  this 
paper  is  to  compare  the  rank-order  of  samples  from  two  or  more  experi- 
ments. The  primary  problems  are  to  select  an  efficient  statistic  and 
test  procedure  which  are  sensitive  to  voiced-unvoiced-silence  param- 
eters but  are  insensitive  to  other  variables  such  as  signal-to-noise 
ratio.  Theoretical  discussions  of  the  following  issues  are  presented 
in  Woinsky  [25]. 

First  consider  the  traditional  hypothesis  test  involving  samples 

from  two  experiments;  more  than  two  samples  are  considered  later. 

The  sets  X - lx,,  x , x } and  Y * jy, , y„,  ...,  y 1 denote  the 

[ l z mj  ^ 1 z n) 

samples  obtained  in  each  experiment  where  the  elements  x^  and  y^ 
represent  amplitude  values  of  random,  independent  samples  of  size  m and 
n,  respectively.  The  sets  X and  Y are  assumed  to  be  from  populations 


with  unknown  continuous  cumulative  distribution  functions  F and  F , 

x y 

respectively.  The  detection  problem  is  to  make  the  decision  F^  * F^ 

or  F ^ F . The  statement  H :F  = F is  the  null  hypothesis.  The 
x y O X y 


alternate  hypothesis  is  H. : F + F . 

i.  x y 


The  null  hypothesis  Hq:Fx  - can  be  tested  without  any  knowledge 

of  Fx  and  F^  using  nonparametric  rank-ordered  statistical  methods  as 

follows.  Since  it  is  assumed  that  F ■ F , all  data  from  X and  Y are 

X y 
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pooled  to  form  the  set  Z » X + Y = jz^,  z^»  . ..,  zm+nj 
in  Z are  assigned  ranks  r^z^  according  to  relative  values  (larger  or 
smaller)  and  reordered  according  to  rank  such  that 


R(Z)  = {r(zi)»  r(z2)»  •**»  r (Zn)}  “ j1*  2»  m+n|“ 


where  N - n+rn.  The  basic  assumption  of  rank-ordered  statistics  is 
that  any  element  in  X or  Y is  equally  likely  to  appear  as  any  given 
rank  in  R(z).  Let  the  elements  in  R(z)  belonging  to  X be  r^x^.  The 
probability  of  occurrence  of  any  specified  rank-ordered  subset  X is 
equally  likely  with  the  probability  of  occurrence  1/^jj^  where  the 
binomial  coefficient  is  all  possible  arrangements  (combinations)  of 
the  subset  X in  Z.  All  probabilities  of  rank-ordered  statistics  can 
be  determined  by  counting  possible  outcomes  and,  consequently,  all 
probability  calculations  are  independent  of  amplitude  information 
(signal-to-noise  ratio). 

The  hypothesis  test  is  completed  by  selecting  a test  statistic 
T and  a decision  threshold  T , i.e.,  if  P(T  > T \ < a,  then  H :F  = F 
is  rejected.  For  the  purposes  of  this  paper,  a single  tail  decision 
is  made  using  a threshold  T^  corresponding  to  the  probability  a of  re- 
jecting Hq  when  Hq  is  true  (a  type  I error). 

Two  nonoptimal  test  procedures  are  considered  which  deal  with 
experiments  involving  multiple  samples,  the  Kruskel-Wallis  and  simultane- 
ous [25,  27,  28,  29,  30],  Two  basic  test  statistics  are  introduced,  the 
Mann-Whitney-Wilcoxon  [ 7 ] and  the  Savage  [25,  31,  32],  which  are  modified 
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for  use  in  the  multiple  test  procedures.  The  modifications  involve  a 
chi-squared  and  mixed  statistic  [33].  The  Mann-Whitney  [7]  and  Savage 
[31]  tests,  which  are  two  sample  tests,  are  discussed  first  to  intro- 
duce basic  concepts  of  the  Mann-Whitney-Wilcoxon  and  Savage  nonpara- 
metric  statistics  before  the  multiple  sample  tests  are  considered. 

The  Mann-Whitney-Wilcoxon 
Statistic  and  Mann-Whitney  Test 

The  Mann-Whitney-Wilcoxon  statistic  S is  simply  the  sum  of  the 
ranks  of  the  elements  belonging  to  X;  i.e., 

m 


S ■ Iml  r(\)  <» 

which  can  be  modified  such  chat 


X - {16,  8,  32} 


Y = {10,  3,  5,  14} 


with  a decision  threshold  P^S  > * a * 0.05.  We  find  that 


Z - {3,  5,  8,  10,  14,  16,  32} 


and  the  rank  sequence  is 


R(Z)  - r(y3),  r(*2),  . r(,4).  , r(«3 


(1,  2,  3,  4,  5,  6,  7} 


The  S statistic  for  this  case  is 


S“3+6+7«  16 


As  a matter  of  counting  we  note  that  the  largest  possible  value  of  S 
could  have  been  18  which  could  have  occurred  once,  S = 17  could  have 
occurred  once,  and  S = 16  could  have  occurred  twice  (S  ■ 3 + 6 + 7 and 
S=4+5+7),  etc.  The  total  number  of  possible  outcomes  is  = 
7!/(3!)(4!)  ■ 35.  Consequently,  the  corresponding  probabilities  of 
the  upper  tall  are 


P(S  - 18)  - 1/35 


P(S  - 17)  * 1/35 


P(S  - 16)  - 2/35 


which  gives  P(S  > 16)  **  4/35  = 0.114  > a * 0.05.  Consequently,  Hq: 

Fx  • Fy  is  accepted.  The  hypothesis  would  have  been  rejected  if  S - 

18. 

For  large  values  of  m,  the  central  limit  theorem  applies  and 

the  T , statistic  approaches  normality.  Tables  for  the  T._,  statistic 
MW  MW 

can  be  found  for  n and  m ranging  up  to  20  [ 7 ] . For  larger  values, 
normal  distribution  tables  can  be  used.  The  Mann-Whitney  test  remains 
unbiased  and  consistent  if  F^  and  differ  only  in  location  of  their 
means  [7],  Consequently,  the  Mann-Whitney  test  is  used  primarily  to 
test  the  difference  in  mean  values;  i.e.,  HQ:E[X]  * E[Y]  or  H^:E[X]  ^ 
E[Y],  Other  tests  such  as  the  Savage  are  more  sensitive  to  differences 
in  variance. 

The  Savage  Statistic  and  Test 

The  Savage  statistic  is  the  optimal  nonparametric  rank-ordered 
statistic  for  random  variables  exponentially  distributed  in  amplitude 
considering  the  hypothesis  Hq:ox  ■ a ^ [ 31  ] where  ox  and  represent 
the  standard  deviations  of  X and  Y.  To  a good  approximation  voiced 
speech  is  exponentially  distributed.  Figure  1 presents  an  amplitude 
probability  density  function  experimentally  determined  from  speech 
[34]  which  is  composed  of  two  components,  voiced  and  unvoiced.  The 
unvoiced  accounts  for  the  high  peak  near  zero  which  tends  to  be  nor- 
mally distributed,  whereas  the  diffuse  tails  near  ±2o  unlike  the  normal 
density  function  are  caused  by  voiced  speech.  Two  exponential  density 
functions.  Gamma  and  Laplace,  are  superimposed  in  Fig.  1 which  better 
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represent  voiced  speech  in  the  neighborhood  of  ±2o. 

Since  the  voiced-unvoiced-silence  decision  thresholds  are  usually 
around  the  2o  diffuse  tail,  better  decisions  can  be  made  if  voiceu 
speech  is  modeled  as  being  exponentially  distributed.  In  nonparametric 
decision  theory,  the  optimal  Savage  statistic  for  exponentially  dis- 
tributed speech  is  [32] 


N 

l 

k-1 


where 

1 if  z,  eX 

k 

Uk-' 

0 if  z,  eY 

k 


(5) 


(6) 


N 

I 


i-N-k+1 


1 

j 


(7) 


N ■ m + n 

The  term  weights  the  rank  elements  in  Z belonging  to  X with  in- 
creasing value  as  k + N.  Consequently,  the  Tg  statistic  gives  more 
emphasis  to  the  statistical  data  near  the  decision  thresholds  than  the 

T statistic.  The  mean  and  variance  of  the  Savage  statistic  are 
MW 


E 
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(9) 


fT  1 - mn 

i N 

i _ i y 

i 

L Sj  N - 1 

N L 

L j-i 

j 

Associated  probabilities  for  decision  purposes  can  be  found  in  [32] 

Table  10  for  n and  m less  than  20.  For  larger  values,  Tg  approaches 
normality.  Consequently  the  normal  distribution  can  be  used  in  con- 
junction with  Eqs.  8 and  9 to  establish  the  decision  threshold  T . 

Kruskal-Wallis  Multiple  Decision  Procedure 

The  voiced-unvoiced-silence  decision  as  described  in  the  follow- 
ing section  involves  independent  samples  from  four  frequency  bands. 

The  Kruskal-Wallis  test  is  considered  since  it  was  specifically  designed 
to  test  the  multiple  sample  problem. 

In  general  consider  K samples 

X1  " |X11’  X12’  '*•’ 

r 

X2  “ \X21’  X22’  •••’ 

• • • 

h “ {XK1’  XK2’ 
number  of  observations  N * 

i-1  ‘ 

r(Xij)*  T^e  samP^es  are  assumed  to  be  distributed  F^, 


with  the  total 
assigned  ranks 


K 

\ n which  are  pooled  and 
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F~,  . F and  K multiple  decisions  are  made  based  upon  the  null 
A K 

hypotheses  • ^F^  «...  - “ Fi+i  “ •••  “ FK)»  The  multiple 

sample  problem  differs  from  the  two  sample  problems  since  two  or  more 
distributions  may  not  be  equal  to  the  remaining.  Consequently  the 
pooled  sample  may  be  biased  (upward  in  the  case  of  speech).  Reference 
[25]  indicates  that  no  optimal  test  statistics  have  been  found.  How- 
ever, a decision  procedure  can  be  formulated  using  the  statistic 


T „ f (N  - ni)  (Tsi  - ni / 

KW  ii1  N Var  [T^ 


(10) 


which  is  asymptotically  chi-squared  distributed  with  K - 1 degrees  of 
freedom  and,  consequently,  allows  use  of  existing  probability  tables 
to  set  T^.  The  - n^J/N  term  asymptotically  removes  the  bias  from 
the  pooled  sample.  The  T term  is  the  Savage  statistic  for  the  ith 

O 1 

sample  with 

E[Tsil  ■ "1  (11) 


and 


Var 


(12) 


The  Savage  test  statistic  Tgi  was  selected  since  it  is  sensitive  to 
voiced  speech  and  a variance  alternative. 
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Simultaneous  Decision  Procedure 

The  Mann-Whitney-Wilcoxon  and  Savage  test  statistics  are  biased 
when  applied  to  the  multiple  sample  case  as  discussed  in  the  previous 
paragraph.  For  small  o <<  1 the  correction  tactor 

«'  - 2a/K(k  - 1)  (13) 

may  be  applied  to  remove  the  bias  [29,  p.  179].  Tests  using  this  cor- 
rection factor  are  referred  to  as  a "Simultaneous  Decision  Procedure". 

Mixed  Statistics 

2 

Feustal  [33]  demonstrated  that  on  the  order  of  N operations  are 
required  to  perform  the  ranking  operation.  Feustal  proposed  a mixed 
statistical  test  that  requires  on  the  order  of  pN  operations  for  the 
case  where  n - n^  ■ n^ . The  n observations  from  each  of  the  K samples 
are  divided  into  p groups  of  q observations.  The  amplitude  values  of 
each  group  are  summed  forming  pK  values  which  are  then  ranked  and 
incorporated  into  any  of  the  above  rank-ordered  tests.  Feustal 
demonstrated  that  negligible  loss  in  efficiency  is  experienced  for 
q > 15. 
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IV.  SYSTEM  DESCRIPTION 


The  operation  of  the  voiced-unvoiced-silence  decision  system 
investigated  in  this  paper  is  presented  in  Fig.  2.  The  system  was 
designed  to  discriminate  against  wide-band  noise  with  a uniform  power 
spectrum  across  the  audio  range.  A bank  of  four  pass-band  filters  was 
used  to  partition  the  frequency  spectrum  into  four  contiguous  intervals 
as  presented  in  Fig.  3.  The  gains  of  each  filter  were  normalized  such 
that  the  average  power  out  of  each  filter  were  equal  for  the  white 
noise  case.  With  voiced  speech  present,  the  probability  distributions 
of  the  signal  from  the  first  two  filters  should  have  larger  variances 
than  the  last  two  filters  as  indicated  by  the  typical  spectrums 
represented  in  Fig.  4.  With  unvoiced  speech  present,  the  probability 
distributions  of  the  signal  from  the  last  two  filters  should  have 
larger  variances  than  the  first  two  filters  as  indicated  in  Fig.  5. 

Under  this  strategy  a few  voiced-unvoiced  decisions  are  likely  to  fail 
with  front  vowels  similar  to  fi]  which  have  strong  second  and  third 
formants  between  3 and  4 kHz.  The  partitioning  of  the  audio  spectrum 
by  the  filter  bank  was  based  upon  equal  contribution  to  the  Articulation 
Index  and  Perceptual  Criteria  discussed  by  [35],  Variations  in  male, 
female,  and  childrens  speech  were  considered. 

The  speech  signal  was  low-pass  filtered  to  3.2  kHz,  sampled  at 
6.67  kHz,  and  high-pass  filtered  at  approximately  200  Hz  to  remove  any 
dc  or  low-frequency  hum.  The  output  from  the  high-pass  filter  was 
formatted  into  blocks  of  100  samples  (15  ms  of  data).  Each  block  of 


1'iyure  3.  PARTITIONING  OF  THE  SPEECH  SPECTRUM  INTO 
FOUR  CONTIGUOUS  BANOS  THAT  CONTRIBUTE 
EOUALLY  TO  ARTICULATION  INDEX.  THE  FRE- 
QUENCY RANGE  IS  200  TO  3200  lb. 


V.  NONPARAMETRIC  DETECTOR  TRADE  STUDY 


As  indicated  in  Section  III,  two  test  procedures  were  selected 
for  evaluation:  Kruskal-Wallis  and  simultaneous  which  included  the 
Mann-Whitney-Wilcoxon  and  Savage  statistics  in  conjunction  with  chi- 
squared  and  mixed  statistics.  The  evaluation  was  based  upon  correct 
decisions  (recognition  rate)  for  each  category — voiced,  unvoiced,  and 
silence  (noise  only).  The  data  base  was  20  words  taken  from  a rhyme 
file  provided  by  Dyna  Stat,  Inc.  [36],  The  words  were:  gob,  sue, 
taunt,  nil,  boast,  jab,  cheat,  said,  gnaw,  weed,  deck,  chew,  thong, 
keep,  got,  dank,  shoes,  shag,  pool,  and  dip.  Wide-band  noise  was 
added  to  a clean  speech  recording  to  produce  signal  to  noise  ratios 
(SNR)  of  30,  20,  10,  and  0 dB.  Reference  voiced,  unvoiced,  and  silence 
classifications  for  the  data  base  were  established  by  close  visual 
inspection  of  the  waveforms  and  by  listening  tests  of  the  clean  speech. 
The  data  were  divided  into  15  ms  blocks. 

Decision  Procedure 

For  each  15  ms  data  block,  100  samples  from  each  of  the  four 
filters  were  pooled  and  ranked.  Each  sample  set  was  represented  as 
X^,  X^,  X^,  and  X^  with  cumulative  distribution  functions  F^,  F2,  F^, 
and  F^  corresponding  to  the  contiguous  filter  banks  starting  with  the 
lowest  frequency  filter  as  indicated  in  Fig.  3.  A test  statistic  T 
for  each  filter  was  formed  according  to  Eq.  2,  5,  or  10,  depending  on 
which  test  procedure  was  being  evaluated.  A critical  value  T^  cor- 
responding to  a 5 percent  false-alarm  rate  (type  I error)  was  selected. 
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The  null  hypothesis  H^F^  * F2  " F3  " F4  was  tested.  If  T < for  all 


four  filters,  the  hypothesis  was  accepted  and  the  decision  made  that 
noise  only  (silence)  was  present.  If  T > T^  for  any  filter,  then  Hq 
was  rejected,  and  it  was  concluded  that  the  signal  was  either  voiced 
or  unvoiced.  If  the  test  statistics  from  more  than  one  filter  were 
greater  than  T^,  then  only  the  largest  T was  considered.  The  voiced 
decision  was  made  if  the  largest  T > T^  was  from  the  first  or  second 
filter.  The  unvoiced  decision  was  made  if  the  largest  T > T^  was  from 
the  third  or  fourth  filter. 


VI.  TEST  RESULTS 


Preliminary  tests  were  conducted  to  establish  a testing 
strategy.  The  Mann-Whitney  simultaneous  test  was  conducted  on  three 
words  and  a 4.5-second  noise  file  to  determine  if  a significant  non- 
zero mean  value  existed  in  the  amplitude  data.  The  hypothesis  that 
the  mean  value  is  zero  could  not  be  rejected  at  the  95  percent  level 
(a  - 0.05).  It  was  concluded  that  short-term  15  ms  data  blocks  at  100 
samples  per  filter  output  would  not  produce  any  significant  nonzero 
mean  value  (all  data  were  high-pass  filtered  with  a stop  band  0 to 
200  Hz).  The  Mann-Whitney-Wilcoxon  statistic  was  discontinued  at  this 
point  in  favor  of  the  Savage  statistic  which  theoretically  is  more 
sensitive  to  voiced  speech. 

The  Savage  statistic  was  tested  on  the  4.5-second  noise  only 
file  using  the  mixed  procedure.  The  amplitudes  of  100  samples  from 
each  filter  were  grouped  into  n * 20  sets  of  5 each.  The  average  of 
each  group  was  ranked  and  used  to  form  a Savage  statistic.  The  cal- 
culated mean  was  19.97  compared  to  the  theoretical  mean  of  20,  Eq.  11. 

The  calculated  variance  was  5.97  (with  a standard  deviation  of  0.56) 
compared  to  a thecretical  variance  of  3.77,  Eq.  12,  which  was  promising. 

The  preliminary  tests  continued  by  comparing  the  mixed  Savage 
to  the  full  rank  (100  ranked  samples  per  filter)  Savage  simultaneous 
decision  procedure  on  three  words.  No  significant  differences  were  ob- 
served in  making  the  voiced-unvoiced-silence  decision.  Values  of  T 

n 

■ 3.30  and  2.39  corresponding  to  «'  - 0.0083  (Eq.  13,  K *■  4)  were  used  for 
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the  decision  threshold  for  the  mixed  and  full  rank  cases,  respectively. 
This  test  was  repeated  using  a mixed  versus  a full  rank  Kruskal-Wallis 
test  procedure.  Likewise  no  significant  differences  were  observed. 
Values  of  - 18.1  and  9.48  corresponding  to  a * 0.05  were  used  for 
the  decision  threshold  for  the  mixed  and  full  rank  cases,  respectively. 
Since  fewer  calculations  are  required  with  the  mixed  statistic,  the 
full  rank  method  was  discarded. 

Continuing,  the  decision  was  made  to  complete  the  tests  by 
comparing  the  recognition  rates  of  the  mixed  Savage  simultaneous  test 
to  the  mixed  Kruskal-Wallis  multiple  test  on  the  20  words  from  the 
rhyme  file.  Tables  I and  II  present  the  recognition  rates.  Data  re- 
ported as  indicate  that  either  no  unvoiced  sounds  occurred  in  the 
corresponding  word  or  a computer  failure  occurred.  Only  recognition 
rates  are  reported  which  are  the  complements  of  type  I and  II  errors. 
The  complement  of  the  silence  recognition  rate  is  a type  I error,  and 
the  average  complement  of  the  voiced  and  unvoiced  recognition  rate  is 
the  type  II  error. 
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Percent 

Recognition 

Silence 

Word  SNR  dB 

30 

20 

10 

0 

Gob 

- 

- 

- 

- 

Sue 

- 

100 

100 

100 

Taunt 

100 

100 

100 

100 

Nil 

100 

100 

100 

100 

Boast 

100 

100 

100 

100 

Jab 

100 

100 

100 

100 

Cheat 

100 

100 

100 

100 

Said 

100 

100 

100 

100 

Gnaw 

100 

100 

100 

100 

Weed 

100 

100 

100 

100 

Deck 

91 

91 

91 

91 

Chew 

100 

100 

100 

100 

Thong 

100 

100 

100 

100 

Keep 

100 

100 

100 

100 

Got 

100 

100 

100 

100 

Dank 

100 

100 

100 

100 

Shoes 

100 

- 

100 

100 

Shag 

100 

100 

100 

100 

Pool 

100 

100 

100 

100 

Dip 

100 

100 

100 

100 

Average  X 

99 

100 

100 

100 

Voiced 


46 
100 
95  78 

89  46 

94  72 

57  38 

88  82 
89  52 

92  67 

93  69 

73  55 

93  86 

86  84 

82  71 

74  57 
72  ' 39 


1 1 III] 


81  50 

92  86 

74  30 

84  65 


Unvoiced 


30  ! 20  10  0 


1 

0 I 0 

0 0 
50  25 

71  29 


45  100  0 

: 41 

; 56  100  33 

' 14  100  75 

82  77  71 

i 30  1 
22 

I45 

| 27  43  14 

! 31  86  86 


100  67 


67  100  0 

16  100  91 

30 
26 

37  80  45 


1 


0 0 
71  29 

33  0 

0 0 

83  50 

50  27 


32  14 


VII.  CONCLUSIONS 

Test  results  presented  in  Tables  I and  II  demonstrate  a level 

of  robustness  based  upon  the  following  observations.  At  30  dB  SNR 

speech  classification  can  be  sustained  at  a high  recognition  rate 

with  a single  threshold  T^  set  by  a theoretical  value  obtained  from  a 

probability  table.  Measurements  of  noise  power  (training  set)  were 

not  used  to  set  T . False-alarm  rates  for  silence  classification 
a 

(type  I error)  remained  relatively  constant  as  the  SNR  was  varied  as 
expected,  although  the  rate  was  less  than  the  predicted  5 percent  in 
most  cases.  The  bias  problem  associated  with  multiple  sample  testing 
accounts  for  this  reduction.  False-alarm  rates  for  voiced  and  unvoiced 
classifications  (type  II  error)  increased  as  the  SNR  decreased  as  ex- 
pected since  Ta  was  set  in  terms  of  a constant  type  I error. 

The  primary  problem  that  caused  a 10  percent  false-alarm  rate 
for  silence  classification  at  30  dB  SNR  in  the  Savage  simultaneous 
test  was  traced  to  a nonuniform  power  spectrum  in  the  background  noise 
of  the  original  speech  recordings.  The  decline  in  recognition  rates 
of  voiced  and  unvoiced  classifications  as  the  SNR  was  reduced  was 
primarily  caused  by  masking  of  the  transitions  between  speech  segments. 
Misclassif ication  of  voiced  as  unvoiced  was  rare,  only  occurring  in 
the  words  "weed"  and  "keep".  No  misclassif ication  of  unvoiced  as 
voiced  occurred. 

As  indicated  in  Tables  I and  II,  the  Savage  simultaneous  test 
was  more  effective  in  classifying  voiced  and  unvoiced  speech,  whereas 
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the  Kruskal-Wallis  test  was  more  effective  in  classifying  noise, 
tails  of  the  tests  are  reported  in  [37], 
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